This week we’re going to be discussing meta analyses. Regardless of if you intend to conduct a meta analysis in the future, you will absolutely encounter them in the literature when conducting lit reviews (for a project or paper). It’s therefore useful to know how meta analyses are typically performed so that you can interpret the results effectively.
For today’s section we’ll replicate a meta analysis about climate change behavior interventions:
https://www.nature.com/articles/s41467-019-12457-2
This paper (Nisa, Bélanger, Schumpe, & Faller, 2019) conducts a meta-analysis to summarize and quantify the effectiveness of different interventions—all tested using randomized controlled trials (RCTs)—designed to promote household behaviors that mitigate climate change. The data and R code (albeit as a PDF for some reason…) are shared openly for this paper, which makes it a convenient example to work with here. We’ll start by loading in the packages and data:
require(pacman)
p_load(meta, metafor, readxl, tidyverse, rstudioapi)
current_path <- getActiveDocumentContext()$path ## gets path for this R script
setwd(dirname(current_path)) ## sets working directory to folder this R script is in
data <- read_excel('Dataset Nisa et al. 2019 NComms.xlsx',
sheet = 'Revised_database_for_R', skip = 1)
# cleaning up some column names with troublesome characters
data <- data %>%
rename(Small_N_0_100 = `Small_N_0-100`,
Medium_N_101_500 = `Medium_N_101-500`,
Large_N_500 = `Large_N_500+`,
Self_Selection = `Self-Selection`,
Water_Towel = `Water+Towel`,
Social_comparison = `Social comparison`)
I’ve had to update the R code slightly in places to get it to run, but I’ll include the code throughout as we move through the workflow.
The general workflow for conducting a meta analysis has four parts:
Only the last step here is actually quantitative! Steps 1-3 are qualitative steps that require you to evaluate a considerable body of literature in detail. This is a process that will take time, so if you are doing a meta analysis yourself, prepare accordingly.
Steps 1-3 are also going to be highly specific to a particular research area or topic depending on your field and the objective of your meta analysis. As a result, we are going to spend most of our time talking about step 4, which is generally concerned with “heterogeneity analysis” – testing how much variance there is in the results across all the considered studies.
The first step in your search for studies to meta-analyze, per the workflow above, is to identify your research question and the pertinent literature for that question.
This step may seem self-evident, but it is not trivial!
You want to have a clear procedure for identifying papers that address the topic at issue in your meta-analytic research question. Often this is done using database searches with pre-defined search phrases, although you may need to include a variety of databases in order to ensure you cover the available literature.
In the Nisa et al. (2019) paper, their identification step involved a structured database search based on a consistent set of search keyword criteria, which yielded a body of papers that they then screened for duplicates. Their identification and selection workflow is diagrammed nicely in their supplemental materials.
As with any analysis, it’s extremely important in meta analysis to be clear about what’s going into the model. Specifically, we want to be deliberate and explicit about the inclusion criteria used to select each study whose results are being compared.
As you are deciding on your inclusion criteria, there are a few important things to keep in mind:
First, this may be obvious, but meta analyses should compare results across empirical studies. The typical objective of a meta analysis is to summarize a set of empirical findings across a selection of studies targeting the same phenomenon in the world.
Additionally, it’s essential that the effects you compare in your analysis are drawn from studies that are sufficiently similar in design and analysis that the effects they obtain are comparable. It’s true that the beauty of effect sizes—particularly standardized effect sizes—is their cross-context comparability. But two studies are not necessarily comparable simply because they both produce results in the form of correlation coefficients or standardized beta parameters.
This does not mean your studies have to be replications of one another! But it is important that you have substantive, consistent criteria that justify grouping all your selected studies together for a joint analysis. For example, you might select a set of studies that all use a particular instrument or metric (e.g., a race IAT).
For our Nisa example paper, this selection screening had a number of experimental design requirements, namely that all the studies considered involved RCTs. They describe these steps clearly in their methods section:
Initial searches held 29,820 electronic records. After duplicates were removed (N = 13,562), at a first stage of screening, the criterion of experimental design eliminated 6376 records, whereas the criterion of observed behavioural outcomes removed an additional 8593 records. A more extensive screening was performed on a working sample of 1289 papers. A final selection at this stage was based on removing quasi-experimental or pretest–posttest studies without a control group, papers without proper statistics reported and studies reporting group-level changes (e.g., per buildings, student halls, employee departments) without any link to individual-level behaviour changes.
The final sample according to the inclusion criteria covered 83 studies, providing a total of 157 estimates. Not all estimates were unique, meaning several originated from the same dataset which violates independence of observations and induces same data/author bias. These 157 estimates were used only when subgroup comparisons were made. For instance, if the same study reported estimates on the effect of an intervention in walking, cycling and car use, only a randomly selected estimate was used to estimate the global effect for transportation but they were all used if the goal was to estimate the specific effect in walking or cycling). Only 144 estimates are unique, that is, from independent datasets.
Even with their clear criteria, actually making these inclusion/exclusion decisions can be subjective (and time consuming!). The authors here dealt with this by having an author and two RAs independently review the titles and abstracts of every paper, with decisions validated by consensus across readers.
Once you have all your studies identified, the next step is to extract the relevant data from each study in a consistent, standardized way.
The most important part of this step is deciding on your method of standardization. In order to compare effects across studies, you will need to quantify and convert those effects into standardized effect sizes. We’ve talked a bit about effect sizes in the power analysis units, and we will talk more about them this week, but for now we’ll just mention a few important ones for meta analysis that are especially common:
Correlation coefficient \(r\) (often transformed with Fisher \(r\)-to-\(z\) transformation)
Cohen’s \(d\) (or other standardized mean differences, e.g., Hedges’ \(g\))
Odds ratios
For the Nisa paper, the raw analyses they considered had a wide range
of outcome measures and units (the interventions they look at target a
lot of different behaviors), so they converted all the outcome measures
into Cohen’s \(d\) units to quantify
how much a given intervention reduced natural resource usage (i.e.,
mitigated climate change). These values are coded as SMD
for “standardized mean difference”.
data
Another thing that must be factored into any meta-analytic comparison
of effect sizes is the sample size of each included
study. This is essential because, as we’ve discussed recently in the
power units, sample size influences the precision of an estimate. So
when the studies included in our meta analysis have different sample
sizes, we should not weight them all equally when it comes to comparing
their effect estimates. Sample sizes are coded as N_EG
(experimental group N) and N_CG (control group N) in the
data:
data %>% select(Study, N_EG, N_CG, SMD, se)
Implicit in this consideration is also the need to include the standard error for the observed effect in a given study, since this is the actual measure of precision we have for each study. We’ll use this more formally as part of our heterogeneity analysis next.
There is often other information that you’ll want to factor in (or “control for” in) your comparison across the studies in your meta analysis. These other factors are typically referred to as moderators in meta analysis.
The other columns in the Nisa data show us some of the moderators they considered. These include things like the duration of the intervention, as well as categorical variables that code what kind of intervention was considered. We might expect that some of the variance between studies’ effects is structured based on these kinds of variables, instead of random due to noise.
Alright, finally we’ve come to the analysis part of the meta analysis!
The main objective of a meta analysis is to quantify heterogeneity in the observed effect across the selected studies.
These differences—in other words, the variance in the effect—could be attributed to differences in the samples or the methods, in ways that are either random or structured (as discussed above).
To model this heterogeneity using the observed effects we have in hand, we need a model that quantifies each observed effect as a linear combination of the true effect and its variance. This basic model structure is:
\[ \theta_i = \mu + \epsilon_i \]
where \(\theta_i\) is the effect size for each study, \(\mu\) is the true (or average) effect, and \(\epsilon_i\) is the error term. The error term is assumed to be normally distributed with variance \(\tau^2\) (i.e., \(\epsilon_i ~ N(0, \tau^2)\)). That error variance \(\tau^2\) is the heterogeneity parameter we’re looking to estimate, alongside the true effect \(\mu\).
To help show this, imagine that \(\tau^2 = 0\). In that case, every study effect \(\theta_i\) would have to be identical—they would all have to be equal to \(\mu\). Thus, any deviations from \(\mu\) that we attribute to “noise”, we would consider to be a function of the error variance \(\tau^2\).
Deviations we think are structured due to the influence of moderators can be accounted for with additional additive variables.
Lastly, remember that we’re actually going to estimate \(\mu\) as a weighted average of the \(\theta_i\) set based on the sample sizes of each study.
The selection process for our meta analysis produces a set of studies to consider—but is this set of studies exhaustive of the relevant studies for the question at hand?
We generally assume it is not; rather, we assume that our studies are sampled from some hypothetical distribution of possible studies. Consequently, we usually use a random effects model to estimate these parameters.
If we did a fixed effects analysis, we would only be able to make inferences about the specific studies we included. But if we assume that the studies we include are just a random sample of the possible studies, a random effects analysis allows us to make inferences about the larger population of possible studies.
The most basic test for heterogeneity is to perform an inference test as to whether \(\tau^2 = 0\). This is known as Cochran’s \(Q\)-test.
In addition to estimating \(\tau^2\), typically done using a regression technique called metaregression, there are other indices for heterogeneity that are often used. These include:
Simply put: I^2 describes the magnitude of heterogeneity among studies, while H^2 estimates the extent to which true differences in effect sizes contribute to the observed heterogeneity among studies.
The Nisa paper uses \(I^2\) as an indicator of heterogeneity, and reported throughout the paper alongside estimates for Cohen’s \(d\) for each effect considered.
There are benchmarks for \(I^2\) as with most standardized effects. They are:
0% to 40%: might not be important;
30% to 60%: may represent moderate heterogeneity;
50% to 90%: may represent substantial heterogeneity;
75% to 100%: considerable heterogeneity.
The main model in the Nisa paper is a random effects model, which they fit using their pre-calculated Cohen’s \(d\) values for each observed effect. (Comments added to explain each parameter in the model fit call are mine.)
# Random-effects meta-analysis with pre-calculated estimates (SMD) and using DerSimonian and Laird method
m.hksj <- metagen(TE = data$SMD, ## treatment effects vector
seTE = data$se, ## treatment SE vector
data = data, ## data argument
studlab = paste(data$Study), ## labels
fixed = FALSE, ## no fixed effects
random = TRUE, ## yes random effects
method.tau = "DL", ## DerSimonian & Laird method
hakn = TRUE, ## Hartung & Knapp CI adjustment
prediction = TRUE, ## print prediction interval
sm = "SMD") ## indicates SMD as summary measure type
m.hksj
## Number of studies combined: k = 144
##
## SMD 95%-CI t p-value
## Random effects model (HK) -0.0932 [-0.1230; -0.0634] -6.18 < 0.0001
## Prediction interval [-0.1819; -0.0045]
##
## Quantifying heterogeneity:
## tau^2 = 0.0019 [0.0119; 0.0510]; tau = 0.0438 [0.1091; 0.2259]
## I^2 = 64.6% [57.7%; 70.3%]; H = 1.68 [1.54; 1.83]
##
## Test of heterogeneity:
## Q d.f. p-value
## 403.44 143 < 0.0001
##
## Details on meta-analytical method:
## - Inverse variance method
## - DerSimonian-Laird estimator for tau^2
## - Jackson method for confidence interval of tau^2 and tau
## - Hartung-Knapp (HK) adjustment for random effects model (df = 143)
## - Prediction interval based on t-distribution (df = 142)
Estimates show that \(\tau^2 =\) 0.0019, and that this is significantly different from \(\tau^2 = 0\) according to the \(Q\) test. In other words, there is heterogeneity in the effects of different interventions on household climate mitigation.
Further, the \(I^2 =\) 64.5544418% tells us that the heterogeneity between studies that is not attributable to sampling variance may be “substantial” according to our benchmarks.
Finally, the random effects model’s estimate of the (weighted) average effect size is \(d =\) -0.093, which indicates a very small effect.
The authors note that the high \(I^2\) in combination with the variety of intervention types used in their study may suggest the need to account for these intervention types as moderators in metaregression.
The first metaregression they describe in the results is a simple test to see whether intervention duration explains any of the variance in the observed effects. The R code for this analysis isn’t included in their supplement, but I recreated it below:
# Metaregression testing for intervention duration as a moderator
metareg(m.hksj, ~Duration_intervention_days)
##
## Mixed-Effects Model (k = 144; tau^2 estimator: DL)
##
## tau^2 (estimated amount of residual heterogeneity): 0.0028 (SE = 0.0019)
## tau (square root of estimated tau^2 value): 0.0528
## I^2 (residual heterogeneity / unaccounted variability): 64.78%
## H^2 (unaccounted variability / sampling variability): 2.84
## R^2 (amount of heterogeneity accounted for): 0.00%
##
## Test for Residual Heterogeneity:
## QE(df = 142) = 403.2242, p-val < .0001
##
## Test of Moderators (coefficient 2):
## F(df1 = 1, df2 = 142) = 3.7019, p-val = 0.0564
##
## Model Results:
##
## estimate se tval df pval ci.lb
## intrcpt -0.1241 0.0189 -6.5738 142 <.0001 -0.1614
## Duration_intervention_days 0.0001 0.0001 1.9240 142 0.0564 -0.0000
## ci.ub
## intrcpt -0.0868 ***
## Duration_intervention_days 0.0002 .
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results here show that intervention duration does not have a significant effect. (Note that the inference test fitted in my code here gives a different \(p\)-value than the one in the paper, but the estimate is the same—I’m not sure why this is, but since they don’t provide their code for this I can’t compare.)
The authors fit additional metaregressions testing for other potential moderators which you can review in the paper. I performed one “omnibus” metaregression below where I just throw in every variable they have in their table—this isn’t necessarily the best way to do this, but I just wanted to show you one way this could look:
# Get the list of moderator variable names and format as formula
moderators <- colnames(data)[7:28]
mod_form <- as.formula(paste("~", paste(moderators, collapse = "+")))
# Fit metaregression with additive moderators
m.metareg_full <- metareg(m.hksj, mod_form)
## Warning in find(i): elements of 'what' after the first will be ignored
## Warning in find(i): elements of 'what' after the first will be ignored
## Warning in find(i): elements of 'what' after the first will be ignored
## Warning in find(i): elements of 'what' after the first will be ignored
## Warning in find(i): elements of 'what' after the first will be ignored
## Warning: Using formula(x) is deprecated when x is a character vector of length > 1.
## Consider formula(paste(x, collapse = " ")) instead.
## Warning: Redundant predictors dropped from the model.
summary(m.metareg_full)
##
## Mixed-Effects Model (k = 144; tau^2 estimator: DL)
##
## logLik deviance AIC BIC AICc
## 9.1537 236.9349 21.6926 81.0889 28.5219
##
## tau^2 (estimated amount of residual heterogeneity): 0.0024 (SE = 0.0016)
## tau (square root of estimated tau^2 value): 0.0487
## I^2 (residual heterogeneity / unaccounted variability): 42.88%
## H^2 (unaccounted variability / sampling variability): 1.75
## R^2 (amount of heterogeneity accounted for): 0.00%
##
## Test for Residual Heterogeneity:
## QE(df = 125) = 218.8236, p-val < .0001
##
## Test of Moderators (coefficients 2:19):
## F(df1 = 18, df2 = 125) = 4.7087, p-val < .0001
##
## Model Results:
##
## estimate se tval df pval ci.lb
## intrcpt -0.2416 0.6878 -0.3513 125 0.7260 -1.6028
## Households -0.0227 0.0576 -0.3929 125 0.6950 -0.1367
## Small_N_0_100 -0.2053 0.0616 -3.3343 125 0.0011 -0.3272
## Medium_N_101_500 -0.0379 0.0528 -0.7166 125 0.4749 -0.1424
## Self_Selection -0.0455 0.0478 -0.9527 125 0.3426 -0.1401
## Region -0.0028 0.0203 -0.1393 125 0.8895 -0.0429
## Electricity 0.0313 0.6754 0.0463 125 0.9632 -1.3054
## Appliances -0.0045 0.0863 -0.0521 125 0.9585 -0.1752
## Water 0.0675 0.0749 0.9013 125 0.3692 -0.0808
## Water_Towel 0.0161 0.6789 0.0238 125 0.9811 -1.3275
## Meat 0.1264 0.6925 0.1825 125 0.8555 -1.2441
## Foodwaste 0.0538 0.7008 0.0768 125 0.9389 -1.3331
## Recycling -0.1721 0.6803 -0.2530 125 0.8007 -1.5184
## type_transport 0.0004 0.0672 0.0066 125 0.9947 -0.1325
## Information 0.2261 0.1212 1.8649 125 0.0645 -0.0138
## Appeals 0.1972 0.1334 1.4786 125 0.1418 -0.0668
## Engagement 0.1912 0.1208 1.5819 125 0.1162 -0.0480
## Social_comparison 0.1382 0.1191 1.1603 125 0.2481 -0.0975
## Duration_intervention_days 0.0001 0.0001 0.7139 125 0.4766 -0.0001
## ci.ub
## intrcpt 1.1196
## Households 0.0914
## Small_N_0_100 -0.0834 **
## Medium_N_101_500 0.0667
## Self_Selection 0.0490
## Region 0.0373
## Electricity 1.3680
## Appliances 0.1662
## Water 0.2158
## Water_Towel 1.3598
## Meat 1.4969
## Foodwaste 1.4408
## Recycling 1.1743
## type_transport 0.1334
## Information 0.4660 .
## Appeals 0.4611
## Engagement 0.4303
## Social_comparison 0.3738
## Duration_intervention_days 0.0002
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A few other analyses are fitted and reported in the paper that are fairly common for meta analyses. In particular, these include the funnel plot and the forest plot, which help visualize some of the quantitative results of the meta analysis beyond just describing the heterogeneity with a summary statistic.
The funnel plot shows the relationship between the observed effects and the observed precision (i.e., standard errors). If our sample of studies was truly random, we should see a relationship where as the standard error observed in the study increases, the variance around the mean estimate increases symmetrically.
If the variance is not symmetric, it may indicate a “small study bias”, where lower-precision estimates are systematically giving effects that deviate to one direction of the observed effect.
In the Nisa meta analysis, they observe such a bias: the low-precision estimates tend to show larger negative effects disproportionately.
## Small-study bias
funnel(m.hksj, xlab = 'Effect estimate', col = 'black', cex = 1.5,
hlines = 'red', bg = 'black', ylim = c(1.1, 0),
contour = c(.95, .975, .99), col.contour = c('darkgray', 'gray', 'lightgray'))
legend(1.4, 0, c("Studies","< 0.01", "p<0.025","p < 0.05","p>0.05" ),bty = "o",
pch=c(16,NA,NA,NA,NA), fill=c(NA,"lightgray","gray","darkgray","white"),
bg='white',border = "black")
Finally, the most common type of plot shown in a meta analysis is a forest plot. These plots show the distribution of effect sizes around the estimated “true” effect from the random effects model, with inference parameters from the model that integrate the weighting done using the standard errors from the observed effects.
forest(m.hksj)
You can also do a cumulative meta analysis, where the estimated true effect tracks the accumulation of evidence over time; this is what is done in the Nisa paper. I reproduced their code for this plot below with some small edits—it looks aesthetically a bit different from what’s in the paper, but it captures the same information and overall trend.
#Fit a simple random-effects meta-analytic model
random = rma(yi = data$SMD, vi = data$se, method = "DL", test = "t",
slab = paste(data$Study), tau2 = "TRUE")
tmp <- cumul(random, order = as.factor(data$Year), transf = TRUE)
forest(tmp, xlim = c(-4, 2), at = c(-2, -1, -0.5, -.1, 0),
digits = c(2, 2), cex = 0.5, yaxt = "n")